Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3307: duplicate key bug for public contact #3362

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

zandercymatics
Copy link
Contributor

@zandercymatics zandercymatics commented Jan 17, 2025

Ticket

Resolves #3307

Changes

  • Adds a try/catch and transaction.atomic() block in the get_or_create_public_contact function
  • Adds a unit test

Context for reviewers

As far as I can tell, this appears to be a race condition (notes). In both instances of this bug occurring, get_contact_in_keys was called twice with a very small time interval between. Right after this occurs - we get this fail condition. In addition, it seems like our app was undergoing heavy load at this same exact time (use the view surrounding documents button).

The check in _get_or_create_public_contact is very concrete in that we just return the record if one exists but does not modify it. This is fine sequentially, but it's vulnerable to a common "check-then-act" type of race condition. Consider this sequence:

  1. Thread A checks if contact exists (returns empty queryset)
  2. Thread B checks if contact exists (returns empty queryset)
  3. Thread A creates contact
  4. Thread B attempts to create contact => IntegrityError

In other words, it appears as though the only code path that can cause an integrity error is if the filter return an empty queryset at just the right time. I've included a unit test for this, and it reliably reproduces the bug without my fix and the test passes with it.

I've directly tested (modified code for) two alternative ideas:

  1. The get_id on public_contact.py function returns the same id.
    a. Sequentially, the probability of collision is really low here -- roughly 10^25. random.choices() isn't uniformly distributed though like uuid, but the probability space is so high that I find this option less convincing.
    b. In parallel (threads), random is thread-safe. Pythons docs note concurrency issues when using a free-threaded build (so without the GIL), but we do not use that build.
  2. The interaction between _get_or_create_public_contact and _add_missing_contacts_if_unknown creates duplicates because .save() is called twice on the same record.
    a. It is worth noting that _add_missing_contacts_if_unknown calls save, but doesn't check for duplicates like get_or_create does.
    b. This doesn't seem consistent with our log stack and is in line with what addAllDefaults does.
    c. Sequentially, I couldn't find a code path for an id conflict to occur because this error happens within _get_or_create_public_contact (which is called before add_missing_contacts) AND we have a check for its existence before attempting a save.

Ultimately though, we are missing just enough logging to prevent a more exhaustive search here. I've tried to add it to places that I think are important.

Setup

This is fairly tricky to test as this bug is somewhat rare in the first place. I've included a test case that is able to reliably reproduce the issue though - and that might be a good place to test this.

To cause this error, do the following in _get_or_create_public_contact:
Comment out this code:

            try:
                with transaction.atomic():
                    public_contact.save(skip_epp_save=True)
                    logger.info(f"Created a new PublicContact: {public_contact}")
            except IntegrityError as err:
                logger.error(
                    f"_get_or_create_public_contact() => tried to create a duplicate public contact: {err}",
                    exc_info=True,
                )
                return PublicContact.objects.get(
                    registry_id=public_contact.registry_id,
                    contact_type=public_contact.contact_type,
                    domain=self,
                )

And replace it with this (or the old code):

public_contact.save(skip_epp_save=True)

Then run the included unit test. You should reliably see an integrity error as per the original issue description.

Code Review Verification Steps

As the original developer, I have

Satisfied acceptance criteria and met development standards

  • Met the acceptance criteria, or will meet them in a subsequent PR
  • Created/modified automated tests
  • Update documentation in READMEs and/or onboarding guide

Ensured code standards are met (Original Developer)

  • If any updated dependencies on Pipfile, also update dependencies in requirements.txt.
  • Interactions with external systems are wrapped in try/except
  • Error handling exists for unusual or missing values

Validated user-facing changes (if applicable)

  • Tag @dotgov-designers in this PR's Reviewers for design review. If code is not user-facing, delete design reviewer checklist
  • Verify new pages have been added to .pa11yci file so that they will be tested with our automated accessibility testing
  • Checked keyboard navigability
  • Tested general usability, landmarks, page header structure, and links with a screen reader (such as Voiceover or ANDI)

As a code reviewer, I have

Reviewed, tested, and left feedback about the changes

  • Pulled this branch locally and tested it
  • Verified code meets all checks above. Address any checks that are not satisfied
  • Reviewed this code and left comments. Indicate if comments must be addressed before code is merged
  • Checked that all code is adequately covered by tests
  • Verify migrations are valid and do not conflict with existing migrations

Validated user-facing changes as a developer

Note: Multiple code reviewers can share the checklists above, a second reviewer should not make a duplicate checklist. All checks should be checked before approving, even those labeled N/A.

  • New pages have been added to .pa11yci file so that they will be tested with our automated accessibility testing
  • Checked keyboard navigability
  • Meets all designs and user flows provided by design/product
  • Tested general usability, landmarks, page header structure, and links with a screen reader (such as Voiceover or ANDI)
  • (Rarely needed) Tested as both an analyst and applicant user

As a designer reviewer, I have

Verified that the changes match the design intention

  • Checked that the design translated visually
  • Checked behavior. Comment any found issues or broken flows.
  • Checked different states (empty, one, some, error)
  • Checked for landmarks, page heading structure, and links

Validated user-facing changes as a designer

  • Checked keyboard navigability
  • Tested general usability, landmarks, page header structure, and links with a screen reader (such as Voiceover or ANDI)
  • Tested with multiple browsers (check off which ones were used)
    • Chrome
    • Microsoft Edge
    • FireFox
    • Safari
  • (Rarely needed) Tested as both an analyst and applicant user

References

Screenshots

Copy link

🥳 Successfully deployed to developer sandbox za.

Copy link

🥳 Successfully deployed to developer sandbox za.

@zandercymatics zandercymatics changed the title [DRAFT] #3307: duplicate key bug for public contact #3307: duplicate key bug for public contact Jan 17, 2025
Copy link

🥳 Successfully deployed to developer sandbox za.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate key issue in PublicContact
1 participant